The Data Mining Process Overview
The full data mining process includes several stages:
- Data Selection: Choosing relevant data sources.
- Data Preprocessing: Cleaning, transforming, and preparing data (handling missing values, outliers, etc.).
- Data Mining: Applying algorithms to extract patterns/models (the focus of most courses).
- Interpretation/Evaluation: Analyzing results and validating them.
While the full process is important, this lecture emphasizes the Data Mining step. A common standard model is CRISP-DM (Cross-Industry Standard Process for Data Mining), which visualizes the iterative nature:


Four Key Elements of Data Mining Algorithms
Every data mining algorithm revolves around four interconnected elements:
- Task Specification: What are we trying to achieve?
- Knowledge Representation: How do we represent the discovered knowledge?
- Learning Technique: How do we search for and score the best model?
- Prediction and/or Interpretation: How do we use and understand the results?
1. Task Specification
This defines the goal of the analysis. There are four classical types of data mining tasks:
A. Exploratory Data Analysis (EDA)
Goal: Explore data without a specific hypothesis to summarize characteristics. Mainly uses visualization.
Example: Restaurant Tip Analysis
- Basic histogram of tips: Shows mode around $2, right-skewed distribution (few high tips).
- Fine-grained histogram: Spikes at whole/half dollars (people tip in 50-cent increments).
- Scatter plot (bill vs. tip): Linear relationship, ~16% average tip rate.
- Segmented: Less variance in female parties, more in smoking parties.


B. Predictive Modeling (Supervised Learning)
Goal: Build a model to predict a target variable from features.
- Classification: Discrete target (e.g., spam/not spam).
- Regression: Continuous target (e.g., house price).
Examples:
- Zestimate: Predict house price from features.
- Loan default: Use income, criminal record, employment.


C. Descriptive Modeling (Unsupervised Learning)
Goal: Summarize data structure without a target (e.g., clustering, density estimation).
Example: Video scene clustering using RGB values → Groups like “foosball table” (orange) vs. “bookshelf” (green).


D. Pattern Discovery
Goal: Find local patterns in subsets (e.g., association rules).
Key difference: Local (subsets) vs. global models.
Example: Beer → Diapers (only in certain transactions).


Task classification examples:
- Sales forecast: Predictive
- Customer segmentation: Descriptive
- Pregnant customer prediction: Predictive
- Beer & Diapers: Pattern Discovery
2. Knowledge Representation
Defines the format of models/patterns (the “hypothesis space”).
- Predictive: If-then rules, decision trees, linear/logistic regression.
- Example Rule: If income > $70k AND no criminal record → loan = yes
- Logistic Regression:
log(P(Y=1|x)/(1-P(Y=1|x))) = β₀ + β·x - Linear Regression:
y = β₀ + β₁x₁ + ... - Descriptive: Mixture models:
f(x) = Σ w_k f_k(x; θ_k)

- Pattern: Association rules: X → Y
3. Learning Technique
Combines model space, scoring, and search.
- Model Space: Parameters (e.g., thresholds) and structure (e.g., which features).
- Scoring: Error rate (classification), squared error (regression), likelihood (descriptive).
- Search: Optimization (parameters), heuristics (structure).
Example: 1D classification with threshold.
Rule: If x > t then + else –
Search all t, pick lowest error.

Warning: Depends on sample data – small/biased samples can lead to poor thresholds.
4. Prediction and Interpretation
Prediction: Apply model to new data.
Interpretation: Statistical significance, novelty, interestingness.
Full Example: Spam Detection
- Task: Classification
- Features: Word frequencies
- Model: If %george < 0.6 AND %you > 1.5 → spam